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ABSTRACT 

Gene set analysis using biological pathways has 
become a widely used statistical approach for 
gene expression analysis. A biological pathway 
can be represented through a graph where genes 
and their interactions are, respectively, nodes and 
edges of the graph. From a biological point of view 
only some portions of a pathway are expected to be 
altered; however, few methods using pathway 
topology have been proposed and none of them 
tries to identify the signal paths, within a pathway, 
mostly involved in the biological problem. Here, we 
present a novel algorithm for pathway analysis 
clipper, that tries to fill in this gap. clipper im- 
plements a two-step empirical approach based on 
the exploitation of graph decomposition into a 
junction tree to reconstruct the most relevant 
signal path. In the first step clipper selects signifi- 
cant pathways according to statistical tests on the 
means and the concentration matrices of the graphs 
derived from pathway topologies. Then, it identifies 
within these pathways the signal paths having the 
greatest association with a specific phenotype. We 
test our approach on simulated and two real expres- 
sion datasets. Our results demonstrate the efficacy 
of clipper in the identification of signal transduc- 
tion paths totally coherent with the biological 
problem. 

INTRODUCTION 

Recently much attention has been directed toward the 
study of gene sets in the context of microarray data 
analysis (hereafter GSA). A microarray experiment typic- 
ally provides a list of differentially expressed genes 



(DEGs) (1,2) that represent the starting point of a 
highly challenging process of result interpretation. The 
grouping of genes into functionally related entities is of 
great help for interpreting the results. In this context, stat- 
istical methods for the identification of groups of function- 
ally related genes with moderate, but coordinated, 
expression changes are fundamental to help biologists in 
the process of results comprehension. 

Several GSA tests, both univariate and multivariate, 
have been recently developed (3-7). GSA methods can 
be divided into two broad categories: (i) methods based 
on enrichment analysis performed on a hst of genes 
selected through a gene-level test; and (ii) methods based 
on global and multivariate approaches that define a model 
on the whole gene set (8). In general these two approaches 
are based on two fundamentally different null hypotheses: 
the first type hypothesizes the same level of association of 
a gene set with the given phenotype as the complement of 
the gene set (say, Ql). The second type only considers the 
genes within a gene set and hypothesizes that there is no 
gene in the gene set associated with the phenotype (say, 
Q2) (9). Goeman and Buhlmann (5) termed these 
approaches competitive and self-contained, respectively. 
The main drawbacks with competitive methods are (i) 
the assumption that genes are independent; and (ii) the 
use of a cut-off threshold for the selection of DEGs. 
In this way, many genes with moderate but meaningful 
expression changes are discarded by the strict cut-off 
value, which leads to a reduction in statistical power. On 
the other hand, global and multivariate approaches relax 
the assumption of independence among genes belonging 
to the same gene sets and identify moderate, but 
coordinated, expression changes that cannot be detected 
by the previous approach without depending from any 
arbitrary cut-offs. 

In general, the a priori definition of gene sets is obtained 
from Gene Ontology (GO) (10) information or from bio- 
logical pathways; while genes belonging to a GO category 
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do not have any explicit connections among them (apart 
from being involved in the same function), genes in the 
same pathway are structured in a network with explicit 
biological interactions. Almost all of the self-contained 
approaches, when apphed to biological pathways, use 
merely the list of genes belonging to a pathway, and there- 
fore, although effective, miss the relevant topological in- 
formation contained. 

In the last years, httle effort has been done to consider 
the topological information within the self-contained GSA 
methods. The seminal paper by Draghici et al. (4) 
proposed an interesting approach (called Impact 
Analysis, SPIA (11)) attempting to capture several 
aspects of the data: the fold change of DEGs, the 
pathway enrichment and the topology of signahng 
pathways. In particular, SPIA enhances the impact of a 
pathway if the DEGs tend to lie near its entry points. 
Recently, Isci et al. (6) proposed a Bayesian pathway 
analysis that models each biological pathway as a 
Bayesian network and considers the degree to which the 
model fits the observed experimental data. Both 
approaches test the whole pathway without providing 
the user with the portions of the pathway that are effect- 
ively associated with the phenotype. This is an essential 
information especially when the pathway is large. 

To this end, Laurent et al. (12) developed a graph- 
structured two-sample test of means for problems in 
which the distribution shift is assumed to be smooth on 
a given graph and devised branch and bound algorithms 
to systematically apply their test to the subgraphs of a 
large graph, without enumerating and testing these 
subgraphs one-by-one. Alternatively, Massa et al. (13) 
introduced an innovative approach based on Gaussian 
graphical models that tests both differences in mean and 
in covariance matrices between two experimental condi- 
tions. In particular, the graphical models context is useful 
to decompose the overall graph (obtained from the 
pathway) into smaller parts (cliques), that can be 
explored and tested in detail. 

An alternative approach was proposed by Emniert- 
Streib (7) that proposes to infer the undirected dependency 
graphs representing pathways. Briefly, given two groups, 
Emmert-Streib (7) infers the dependency structure of 
genes belonging to the same GO group using Pearson cor- 
relation and partial Pearson correlation independently on 
both groups, and then tests the similarity of the inferred 
graphs using a graph edit distance and a permutational 
approach. 

In this work, we take the starting point that pathways 
are the best representation of biological experimentally 
validated knowledge of a specific process. In fact, the an- 
notation of a biological pathway is the result of an exten- 
sive effort of hundreds of researchers that manually codify 
their experimental knowledge about a specific biological 
process into a graphical representation. Therefore, we 
decide to consider the topology of the pathway as fixed. 

FoUowing Massa et cd. (13), we propose an empirical 
two-step method, called clipper hereafter, for the 
identification of significant signal transduction paths 
within significantly altered pathways. In particular: 
(i) we generalize the approach of Massa et al. (13) to the 



case of P^n (with P number of genes/variables and n 
number of samples/rephcates), using shrinkage and a 
graphical lasso penalty estimators of the covariance 
matrices; and (ii) by exploiting the structure of a 
junction tree derived from an initial graph, we propose a 
procedure to highhght the portions, called signal paths, of 
a pathway mostly correlated with the phenotype. 

We test our approach on simulated and real expression 
datasets of completely different biological problems 
(cancer and muscle disorders). The obtained results 
provide evidence of the success of our approach in the 
detection of altered pathways and, more importantly, in 
the identification of novel signal paths. We believe that 
clipper could become an important tool for gene ex- 
pression data interpretation. 



MATERIALS AND METHODS 

To implement topology-based GSA using microarray 
data, we need first to convert pathways into gene 
networks, i.e. into a graphical structure in which a node 
represents a simple element like a gene/protein (14). In 
fact, whereas pathway nodes might consist of multiple 
entities such as protein complexes, gene family members 
and chemical compounds, microarrays measure each 
single element of complexes and gene family separately. 
Here, we used graphite (14), a Bioconductor package 
addressing these issues. In general, graphite takes 
pathway information from four different databases 
(Biocarta; KEGG, (15); NCI/Nature Pathway 
Interaction Database, (16); Reactome, (17)) and this in- 
formation is interpreted and opportunely coded by foUow- 
ing specific biologically driven rules. Specifically, given a 
pathway structure, graphite converts it into a gene- 
gene network. We refer to the manual of the package 
for more information on the conversion. 

Pathways may be cyclic or acyclic. The number of 
pathways with cycles is dependent either on the structure 
of the graph or on the number of genes in the array, but 
fortunately is quite small. Given that the graphical infer- 
ence methods assume to have an acyclic graph we prevent- 
ively eliminate self-loops and solve cycles removing the 
weakest edge of the cycle based on expression data (with 
minimum expression profile correlation between nodes) 
(see also (18)). 

Then, an acyclic gene network can be read as a Directed 
Acyclic Graph (DAG). Most inference methods for a 
DAG convert the network to an undirected cycle-free 
graph. Such conversion might require some or aU of the 
foUowing steps: moralization, triangulation, chque identi- 
fication and junction tree construction. Briefly, moraliza- 
tion inserts an undirected edge between two nodes that 
have a child in common and then eliminates directions 
on the edges; triangulation inserts edges in the moralized 
graph so that in the moralized graph all cycles of size >4 
have chords, where a chord is defined as an edge connect- 
ing two non-adjacent nodes of a cycle; chque identification 
identifies the chques of the triangulated graph, i.e. the 
complete subgraphs having all their vertices joined by an 
edge; junction tree construction builds a new hyper-tree 
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having cliques as nodes and satisfying the running inter- 
section property according to which, for any cliques C\ 
and C2 in the tree, every clique on the path connecting 
C| and Ci contains C\ n C2. As an example, consider the 
pathway Chronic myeloid leukemia (CML) from KEGG 
database, see Supplementary Figure SI. 



STEP 1: TESTING THE WHOLE PATHWAY 

In specific conditions, the strength of molecular inter- 
actions within a pathway could be altered, making the 
pathway a dynamic entity. It is therefore reasonable to 
test its dynamic perturbation by statistically testing 
equahty of concentration matrices and mean vectors. 
Here, we assume to have two classes of samples (say 
cases and controls) and we suggest to model the data in 
the two experimental conditions with two graphical 
Gaussian models with the same undirected graph G: 

Mi(G) = {F ~7VKMi,Si),i^i = E7'e^(G)}, 
M2(G) = {F ~ 7VKM2> ^2),K2 = ' e S^{G)}, 

where P is the number of genes (vertices of the graph), ^1 
and K2 are the concentration matrices (inverse of the co- 
variance matrices) of the two models and S^{G) is the set 
of symmetric positive definite matrices with null elements 
corresponding to the missing edges of G. Here, G is the 
graph obtained after transforming the network obtained 
from graphite first into a DAG, and, then, into its 
moral graph. 

In Massa et al. (13), two tests were proposed, one 
for the comparison of the strength of the hnks between 
genes in the two experimental conditions and another 
one to test the differential expression of the pathway. 
In the first case, the hypothesis to be tested is 
Hq : K\ — K2 versus H\ : K\ ^ K2. Testing the dif- 
ferential expression of the pathway is achieved by 
checking equahty of means, i.e. Hq : fii — fj.2 versus 
versus Hi : ^\ ^ /X2. Such test has a different structure 
according to whether the two graphical Gaussian models 
M\{G) and M2{G) are homoschedastic, i.e. they have the 
same covariance matrix, or not. 

Once the graph G is known, the null elements in the 
concentration matrices are identified. On the contrary, 
/Lti, /u,2, £1, ^2 are not known and need to be estimated 
from the data. Here, /xi and 112 are estimated with the 
corresponding sample means. The maximum-hkehhood 
estimates of Ei and E2 can be obtained by using the 
Iterative Proportional Scahng algorithm (IPS, see (19, 
p. 134)) and by taking the sample covariance matrices as 
starting values. The IPS guarantees that the estimated 
matrices belong to S^{G). In this case, a necessary condi- 
tion for the existence of the maximum-likelihood estimate 
is that the number of samples is greater than the cardin- 
ahty, i.e. the number of nodes of the largest chque (19, 
p. 133), a setting that is easily missed in case of gene ex- 
pression data (a typical microarray experiment does not 
exceed the few tens of samples per class, and with the 
advent of deep-sequencing technology, this dimension is 
further reduced to few units). In Supplementary Figure S2, 



we report the distribution of the cardinaHty of the largest 
chque per pathway in four different databases. It is worth 
noting that there are several pathways with clique cardin- 
ahty of several tens of nodes that would not be processed 
by the standard IPS algorithm. 

To estimate the covariance matrix in such circum- 
stances, clipper applies a shrinking procedure in the 
estimation of the sample covariance matrices. Apart 
from increased efficiency, the shrunken estimates have 
the additional advantage of being always positive 
definite and well conditioned. Here, we use a James- 
Stein-type shrinkage estimator, as implemented in 
corpcor R package (20,21). 

The shrunken estimates are passed on to the IPS algo- 
rithm. The use of a shrinkage estimator, however, pre- 
cludes the use of the asymptotic distribution of the log- 
hkelihood ratio test which, in standard settings, has a x^+p 
distribution under the homoschedasticity hypothesis, 
where r is the number of edges and P the number of 
nodes of the graph. Here, we will use a permutational 
approach on the samples. 

Even if the IPS algorithm implemented in qpgraph 

(22) is one of most computationally efficient, in some 
cases (very large and complex pathways) it is highly com- 
putationally demanding (e.g. for diverse pathways the IPS 
algorithms takes even several days to converge) and some- 
times it has problems of convergence. Therefore, with 
clipper, we have also investigated the possibihty of 
computing the maximum-hkelihood estimate of the co- 
variance matrices using the approach of Friedman et al. 

(23) , implemented in the R package glasso, where we 
have specified the indices of entries of the inverse covari- 
ance matrix to be constrained to zero and set the regular- 
ization parameter equal to zero. 

As expected, the estimates of the covariance matrices 
obtained by glasso with no regularization and with 
the IPS algorithm are the same. However, we do not 
find significant improvement in the computational effi- 
ciency and both approaches show the same average com- 
putation time. 

To compare portions of the pathways, with the aim of 
identifying subgroups of genes which appear to drive dif- 
ferences (deregulations) of the entire structure, clipper 
performs the above described tests on each single clique. 
To this end, the moral graph is first triangulated (if 
needed). As the cliques are complete connected subgraphs, 
the IPS algorithm is not required to estimate covariances. 



STEP 2: IDENTIFICATION OF RELEVANT SIGNAL 
PATHS 

Using the structure of the junction tree as a backbone, 
clipper empirically identifies the portions of the tree 
mostly associated to the phenotype. For each pathway 
and the corresponding moralized graph, our approach is 
based on three main steps: (i) construction of the junction 
tree; (ii) identification of paths and corresponding 
sub-paths; and (iii) computation of the relevance of the 
sub-paths as specified below. 
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Figure 1. Toy example of step 2 clipper approach. Panel A, the 
construction of the junction tree with significant cliques in red. 
Panel B, identification of the path.s in the tree. Panel C, identification 
of all the sub-paths within each path. Panel D, selection of the best 
sub-path for each path and cluster analysis for sub-path collapse. 
Panel E Final sub-path selected. 



We define a path as the path connecting the root chque 
with a leaf clique [identified by maximum cardinahty 
search (mcs) algorithm]. For each clique along the paths, 
we consider the P-value of the test on homoschedasticity 
as weight w of the clique. From now on, such quantities 
will loose their probability interpretation although they 
will still reflect the importance of the chque difference 
between the experimental conditions. A weight will be 
considered to be meaningful if it is <a. In our analysis 
we set a = 0.05; however, different cut-offs can be used. 
On each path, we select the portions of the path composed 
by consecutive meaningful cliques containing at most one 
non-meaningful chque. Such portions define the so called 
sub-paths. 

An example of the above described steps is given in 
Figure 1. Panel A represents a junction tree with root 
chque cl and three leaf cliques, i.e. c8, clO and cl2. 



Meaningful cliques are highhghted in red. Panel B repre- 
sents the three paths derived from the junction tree. Panel 
C reports the four sub-paths which can be extracted from 
the path. 

The relevance of each sub-path is computed as follows. 
Let Lj be the length of sub-path /, with / = 1, . . . , /. Given 
the weight Wy of each clique i in the sub-path /, 
ie{\...Lj}, the relevance is calculated according to 
Equation (2). Respecting the ordering of the cliques in 
the sub-path, for each clique / in sub-path j, we compute 
the quantity 



\,...,Lj 



k=\ 



where &kj is defined as 



- log(i%) Wkj < a 
log(l - Wkj) Wkj > a. 



(2) 



(3) 



Then, the relevance Rj of sub-path / is defined to be the 



maximum of S, 



of sub-paths of different 
standardized relevance SR, 



= 1, . . . ,Lj. To compare the relevance 
lengths, we introduce the 



max/(5,y) * m 



(4) 



where m is the position of the niax,(5,7) along the sub-path 
/ Finally, for each path, the sub-path with the maximum 
SRj is selected as its relevant signal path. At the end of this 
procedure, a relevant signal path is identified for each 
path. 

clipper results consist of a number of relevant signal 
paths. In most of the cases, paths and sub-paths are highly 
overlapping (see, e.g., sub-paths lb, 2 and 3 in Figure 1). 
Thus, clipper implements a pruning procedure using a 
cluster analysis approach. We define the dissimilarity 
measure between sub-paths A and B, d{A, B), as 



d{A,B) 




B\ <\B-A\ 
B\ > \B-A\ 



(5) 



where A and B are the sets of genes composing sub-paths 
A and B, \A — B\ is the cardinahty of sets difference and 
1^1 is the cardinahty of the set A (similarly are defined 
\B — A\ and \B\). We perform a cluster analysis and 
collapse sub-paths with d(A, B) < e (taking the sub-path 
with the highest relevance). For our analysis, we set 
E = 0.1; however, clipper allows the selection of a dif- 
ferent threshold. For a numerical example, see panel E of 
Figure 1. 



RESULTS 

Rationale 

Different experimental conditions are usually compared in 
terms of their gene expression mean differences. In the 
univariate case, if a gene increases or decreases signifi- 
cantly its mean expression in one condition with respect 
to the other, it is said to be differentially expressed and it is 
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Table 1. KEGG significant patliways of according to the test on tlie means and the test on the concentration matrices 



ID 


Pathway na.iiie 


Adj. /'-values test 1*^' 


Adj. /"-values test l*" 


SPIA* BPA* GSEA* 
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/\U.IlClCllS JLlllLLlUll 


u 


0 nop + no 


Vpc 
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2 




n 

u 


n nop + no 


Vpc 
X Co 


3 


i_yiiciiCLi Ctii Liiuiii y u uci 111 V 


Q 


0 nop + no 




4 




Q 


0 noe + no 

V/.VJUv^ 1 \J\J 




5 




Q 


0 noe + nn 


Yes 


o 


Rcgulcitioii ot Hctin cytosk-clctoii 


n 
u 


0 nop -1- no 


Vpc 
I Co 


7 

1 


V daL Ulal alllUU Lil 111 LlaL.lC CUllLl dt-LlUll 


n 


0 nop + no 




g 


VV IIL al^llclllllg pd.lllWciy 


n 


0 nop + no 


X Ca X Co 


Q 

7 


NtitLU'cil kiUcr ccll-niGcii cited cytotoxicity 


u 


J. /OC — IM- 




10 


Riiptprial invfi»sinn nf pnithplinl ppIIq 


0 


7.68e — 14 




\ \ 


1V/Tp1 a nr\ tTPTiPci c 
iviciciinjgciitais 


Q 


1.54e — 13 


Yes 


12 


llgllL JLlllLLlUll 


VJ 


8 '^d.p 1 1 
0. JtC — IZ, 


Vpc 
X Co 


13 


Tnll-lilt p rpppnt AT qi ov\ alitKr nathwav 


0 


1 f,8e — 10 


Yes 


14 


V 11 dl lliy ULdl Ul llJi 


n 


z.ooc — lU 


Yes 


15 


A ynn tri n Hm npp 


Q 


1.31e — 09 




16 


Odadl LCll Ldl LlllUllld 


n 
\) 


5 90e _ 09 


I 1. Co 


1 7 


Insulin sigiieiling pcithwciy 


n 
u 


1 "^Qp OS 

1 . jye — yjo 


Vpc 


1 8 

1 o 


Acute myeloid leuk.enii3. 


n 


1 AAf OS 




19 


Neurotrophiii sisnaliiig peitliway 


0 


6.69e - 08 






Glycolysis/gluconeogenesis 


n 
u 


s nop OS 

o.UUc — Uo 




21 


kjllliicllUbls 


u 


7 Cidf 07 




22 


'Tf^l-H-npt'i cifTnaliTitT r^'j t \/ 
1 \j 1 ucid dixiidiiiig udiiiwdy 


Q 


3.71e — 07 




23 


LC LlK.UL,y LC LI dllaCllQU LllCtldl 1111^1 d LKJll 


n 


Q 40p 07 


Vpc 
X Ca 


24 


T cell receptor signaling pathway 


0 


3.37e-06 




25 


Chronic myeloid leukemia 


0 


4.40e - 06 




26 


Leishmaniasis 


0 


1.65e-05 




27 


Fructose and mannose metabolism 


0 


1.78e-05 




28 


Systemic lupus erythematosus 


0 


6.32e-05 




29 


Pyruvate metabolism 


0 


1.71e-04 




30 


Fc gamma R-mediated phagocytosis 


0 


6.34e-03 


Yes 


31 


RIG-I-like receptor signaling pathway 


0 


7.03e-03 


Yes 


32 


Pathogenic Escherichia coli infection 


0 


8.13e-03 


Yes Yes 


33 


B cell receptor signaling pathway 


0 


2.77e-02 





In red those pathways including BCR and/or ABL genes, in blue those pathways coherent with experimental evidences. 

"Test on the mean with Bonferroni correction. 

''Test on the concentration matrices with Bonferroni correction. 

*SPIA, BPA and GSEA resuhs using raw P-value^ 0.1. 



assumed to be involved in the biological process under 
study. It is easy to generalize the previous concept to the 
multivariate setting; if a gene set changes significantly its 
multivariate mean expression in one condition with 
respect to the other, it is said to be differentially expressed. 
However, the difference in mean expression levels does not 
necessarily result in a change of the interaction strength 
among genes. In this case, we will have pathways with 
significant altered mean expression levels but unaltered 
biological interactions. 

On the contrary, if transcripts abundances ratios are 
altered, we expect a significant alteration not only of 
their mean expression levels, but also of the strength of 
their connections, resulting in pathways with completely 
corrupted functionahty. Therefore, to look for pathways 
strongly involved in a biological process, we should look 
at pathways with both mean and variance significantly 
altered. 

clipper is based on a two-step approach: (i) it selects 
pathways with both covariance matrices and means sig- 
nificantly different between experimental conditions; and 
(ii) on such pathways, it identifies the sub-paths mostly 
associated to the phenotype. clipper is freely available 



as an R package at http://romualdi.bio.unipd.it/ in 
Software section. 

In this section, we provide (i) a simulation study to test 
the specificity of our approach; and (ii) an application of 
clipper on two real datasets along with a comparison 
with GSEA (3) (non-topological method), SPIA (11) and 
BPA (6) (topological methods). Differently from BPA, 
SPIA requires a list of DEGs. Here, we used empirical 
Bayes test (1) to identify DEGs (implemented in limma 
Bioconductor package). On real datasets, clipper step 2 
will be applied to one of the pathways identified in step 1. 

Simulation 

As some paths may be declared relevant by clipper step 
2 simply as a consequence of type I errors in clipper 
step 1, we developed a simulation study. For 10000 runs, 
we generated two samples, one for each condition, from 
the same graphical model M(G) — {Y ~ N23(fi,'E), 
e 5''^(Cr)} and tested equahty of concentration 
matrices and mean vectors for the whole pathway and 
all the cHques. Under this scenario, at the nominal level 
a = 0.05, we expected: (i) for each test, a number of re- 
jections around 5%; (ii) a scattered location along the 
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Figure 2. clipper results on chronic myeloid leukaemia (CML) KEGG pathway. Panel A, junction tree with significant cliques in blue. The highest 
scored sub-path is highhghted with blue border. Panel B, CML pathway with genes belonging to significant cliques in red or green according to their 
expression mean differences (translocation positive versus negative patients). Panel C, the original KEGG CML layout with complexes belonging to 
the sub-path identified colored according to their expression. 
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Table 2. List of significant KEGG and Reactome pathways according to the test on the means and the test on the concentration matrices 





Pathway name 


Adj. P- values test 1" 


Adj /'-values test 2^ 


1 


KEGG: RIG-I-like receptor signaling pathway 


0 


5.68e- 13 


2 


Reactome: GRB2:SOS provides linkage to MAPK signaling for integrins 


0 


3.22e- 13 


3 


Reactome: DCC-mediated attractive signaling 


0 


8.50e-09 


4 


Reactome: Intrinsic pathway for apoptosis 


0 


1.07e-06 


5 


Reactome: pl30Cas linkage to MAPK signaling for integrins 


0 


1.37e-06 


6 


Reactome: TRAIL signaling 


0 


1.50e-02 


7 


Reactome: signal regulatory protein (SIRP) family interactions 


0 


2.00e - 02 


8 


Reactome: activation of BH3-only proteins 


I 


2.16e-03 



BPA cannot be performed on Reactome database and GSEA does not identify significantly deregulated pathways, neither with Bonferroni adjusted 

P-values nor with nominal P-values. 

"Test on the mean with Bonferroni correction. 

''Test on the concentration matrices with Bonferroni correction. 



junction tree of the statistically significant cliques, 
(ii) implies that the length of significant paths identified 
by clipper step 1 should be rarely (about 5%) longer 
than 1. Results shown in Supplementary Table SI demon- 
strate that our procedures have an excellent control of 
type 1 error in step 1 and very appreciably respond to 
expectations in step 2, even with exceptionally low 
sample sizes. 

Application: ALL dataset 

The dataset we use for this comparison was pubHshed by 
Chiaretti et al. (24) and characterizes gene expression sig- 
natures in acute lymphocytic leukemia (ALL) cells 
associated with known genotypic abnormaUties in adult 
patients. Several distinct genetic mechanisms lead to 
ALL malignant transformations deriving from distinct 
lymphoid precursor cells that have been committed 
to either T-lineage or B-lineage differentiation. 
Chromosome translocations and molecular rearrange- 
ments are common events in B-lineage ALL and reflect 
distinct mechanisms of transformation. The relative 
frequencies of specific molecular rearrangements differ in 
children and adults with B-hneage ALL. The BCR break- 
point cluster region and the c-abl oncogene 1 (BCR/ABL) 
gene rearrangement occurs in about 25% of cases in adult 
ALL, and much less frequently in pediatric ALL. 

Data are available at the Bioconductor site (http:// 
www.bioconductor.org/help/publications/2003/Chiaretti/ 
chiaretti2/). Expression values, appropriately normalized 
according to robust multiarray analysis (rma) and 
quantile normalization, derived from Affymetrix single 
channel technology, consist of 37 observations from one 
experimental condition («i = 37, BCR; presence of BCR/ 
ABL gene rearrangement) and 41 observations from 
another experimental condition {rii = 41, NEG; absence 
of rearrangement). Probes platform have been annotate 
using EntrezGene custom CDF version 14 (25). 

Step 1 results 

Given the presence of the BCR/ABL chimera, we expect 
that all the pathways including BCR and/or ABLl will be 
impacted. The KEGG pathways found to be significantly 
involved (Bonferroni adjusted /"-value < 0.05) in the 
difference between translocation positive and negative 



patients by clipper step 1 analysis are reported in 
Table 1. Firstly, it is worth noting that with an adjusted 
/"-value < 0.05 clipper identifies as significantly dereg- 
ulated almost all (7 out of 9 P-value = 3.279616e-06) 
pathways including BCR and/or ABL genes (in red 
Table 1). On the contrary, GSEA, SPIA and BPA did 
not find any significantly altered pathways using 
Bonferroni adjusted P-value < 0.05. However, if uncor- 
rected P-value<0.1 is considered, SPIA and GSEA 
identify 2 out of 9 (P-value = 0.18) pathways, including 
either ABL and/or BCR genes (Table 1), while BPA 
identifies only one. 

Moreover, most of the other pathways identified by 
clipper are strongly coherent with experimental 
findings on BCR/ABL mechanism. In fact, many signahng 
proteins have been shown to interact with BCR/ABL 
through various functional domains/motifs (e.g. GRB2, 
CRKL, CRK, SHC, 3BP2, ABL-interacting protein 1 
and 2, and CRK-associated substrate (CAS)), and/or to 
become phosphorylated in BCRABL-expressing cells (e.g. 
CRKL, CRK, SHC, GAB2, CBL, CAS, the p85 subunit 
of PI3K, FES, paxillin and talin). These proteins, in turn, 
activate a range of signaling pathways identified by 
clipper (in blue Table 1) that activate proteins such as 
RAS, PI3K,A KT, JNK, SRC family kinases, protein and 
lipid phosphatases, and their respective downstream 
targets, as well as transcription factors such as the 
STATs, nuclear factor-kB and MYC. Most of these 
findings were observed from experiments in vitro 
systems, or from studies of the properties of cells derived 
from leukaemia patients with particular stages of 
disease (26). 

Step 2 results 

Focusing on CML pathway that contains exactly BCR/ 
ABL fusion gene, clipper identifies a sub-path that fits 
perfectly with experimental findings. In particular, the 
highest scoring sub-path is that one starting from BCR/ 
ABL toward the oncogene TP53 (Figure 2). It is known, in 
fact, that the BCR/ABL fusion protein in CML cells, 
promotes the accumulation of p53 and that, in contrast 
to the activation of p53 by c-Abl, its oncogenic form, 
BCR/ABL, counteracts the growth inhibitory activities 
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Figure 3. Intrinsic pathway of apoptosis. Panel A, junction tree with 
significant clique in blue. The highest scored sub-path is highlighted 
with blue border. Panel B, native pathway with genes belonging to 
significant cliques in red or green according to their expression mean 
differences (LGMD2A versus LGMD2B). 



of p53 by modulating the p53-MDMD2 loop. Thus, it 
appears that by modulating the p53-MDMD2 loop, 
c-Abl and its oncogenic forms critically determine the 
type and extent of the cellular response to DNA 
damage (27). 

It is worth noting that the signal path obtained by 
clipper would have not been identified using just the 
Hst of DEGs belonging to this pathway. Only ABLl and 
NFKBIA, in fact, are identified by empirical Bayes test (1) 
as differentially expressed with FDR < 0.1. 

Application: LGMDs dataset 

Limb girdle muscular dystrophies (LGMDs) are a group 
of muscular diseases with heterogeneous chnically and 
genetically features. Globally, they present progressive 
muscle weakness caused by progressive muscle waste 
combined with an increase of muscle connective tissue. 
We analyse a dataset containing 10 LGMD type 2 A 
(LGMD2A) and 10 type 2B (LGMD2B) samples (28). 

LGMD2A is caused by a mutation in the gene calpainB 
(29) that codes for a cysteine protease that cleaves 
cytosckeletal and myofibrillar proteins and serves to 
maintain proper functions and structure of the sarcomere 



(30). LGMD2B is caused by a mutation in the gene 
dysferhn that codes for a sarcolemma protein involved 
in membrane repair and muscle regeneration (31). 
Together with desmoyokin (AHNAK), dysferlin forms 
the dysferlin protein complex involved in the maintenance 
of the sarcolemma integrity (32). AHNAK is also a sub- 
strate of calpain3, and after the cleavage AHNK is not 
able to bind dysferlin anymore confirming the mutual in- 
fluence that calpain3 and dysferlin protein exert each other 
(32). Thus, we expect few molecular differences between 
these pathologies. 

Step 1 results 

In the analysis for LGMDs, we used Reactome and 
KEGG databases stored in graphite. Firstly, we identify 
the involvement of Apoptosis (e.g. pathways 1 and 4 in 
Table 2). In case of stress signals, proapoptotic BCL-2 
family proteins are activated and subsequently interact 
with and inactivate antiapoptotic BCL-2 proteins. This 
interaction leads to the destabilization of the mitochon- 
drial membrane and release of apoptotic factors that 
reduce muscle cell survival in LGMD2A (33). Moreover, 
clipper results help in formulating novel hypothesis on 
this case study. Specifically, we found many pathways 
referred to MAPK signaling (e.g. pathways 2 and 5 in 
Table 2). Our results seem in agreement with (34) that 
recently showed the role of MAPK signahng pathway in 
the LMNA-associated degenerative process and the simi- 
larity of the regulatory processes between LGMD2A, 
LGMD2B and LMNA-associated muscular dystrophy re- 
gardless of the causative gene. 



Step 2 results 

With the step 2 of clipper analysis, we are able to reach 
an even deeper level of accurateness. We focused on 
Intrinsic pathway of apoptosis. Figure 3 shows that the 
signal sub-path identified by clipper include BAX, 
BID, BCL2 and BAD that play a central role in leading 
to apoptosis. 



CONCLUSIONS 

Here, we present clipper, a novel two-step empirical 
method for pathway analysis able to dissect the complex- 
ity of a pathway identifying the portions mostly associated 
to the biological process studied. 

Our empirical approach is fundamentally different from 
previous ones for two reasons. We take into account not 
only expression changes but also differences in transcript 
concentrations, allowing the identification of pathways 
with their functionality completely corrupted. We are 
able to go to the finest details of the pathway structure, 
identifying the signal transduction path that is the princi- 
pal cause of the pathway deregulation. 

clipper efficacy has been vahdated on two expression 
datasets of completely different biological problems 
(cancer and muscle disorders). In both cases, we 
obtained interesting results strongly coherent with 
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experimental findings available in literature. Moreover, 
our results demonstrate the utility of clipper not only 
in the result comprehension but also in driving the experi- 
menter in formulating new hypothesis. We therefore 
believe that clipper would become an important tool 
for gene expression data interpretation. 

SUPPLEMENTARY DATA 

Supplementary Data are available at NAR Onhne: 
Supplementary Table 1 and Supplementary Figures 1 
and 2. 
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APPENDIX 

Gaussian graphical models 

We report here a concise review of Gaussian Graphical 
Models theory. A graph G is a pair G — (V, E), where Fis 
a finite set of vertices and the set of edges £ c Fx F is the 
set of ordered pairs of distinct vertices. If both (m, v) e E 
and (v, m) e E, the edge {u, v) is said to be undirected. If 
(m, v) e E but (v, m) ^ E, the edge (v, u) is said to be 
directed. 

A DAG is a directed graph without cycles. Given a 
DAG, a moral graph is the undirected graph obtained 
from the DAG by adding undirected edges between aU 
pairs of vertices that have a child in common (if they 
are not already present) and then by rendering all edges 
undirected. 

If G is undirected, then a subgraph is complete if all its 
vertices are joined by an edge. Any complete subgraph is a 
clique. A maximal complete subgraph (with respect to c) 
is a maximal clique. In a graphical models context only 
maximal cliques are relevant in estimation problems and 
therefore we will always use the term clique with the 
meaning of maximal cHque. 

A triple {A, B, C) of disjoint subsets of V of an undir- 
ected graph G is a decomposition of G if F = ^ U 5 U C, C 



is a complete subset of F and C separates A and B. An 
undirected graph is decomposable if either it is complete or 
it possesses a proper decomposition {A, B, C) such that 
both subgraphs Gaub and Gbuc are decomposable. 

A triangulated graph (or chordal graph) is an undirected 
graph with the property that every cycle of length « > 4 
has two non-consecutive vertices that are adjacent. An 
important result is that an undirected graph is decompos- 
able if and only if it is triangulated (19, p. 9). If a graph is 
not triangulated, it is possible to add extra edges so that 
the resulting graph is triangulated. It is well known that 
the problem of obtaining an optimal triangulation (i.e. 
finding the smallest number of edges to be added) is 
NP-hard and therefore we rely on the heuristic algorithm 
developed in the R package gRbase, implemented in the 
function triangulate. 

K junction tree of cliques for a graph G is a tree having 
the cliques of G as nodes and satisfying the running inter- 
section property according to which, for any cliques C\ 
and C2 in the tree, every clique on the path connecting 
C\ and Ci contains C\ n C2. Decomposabihty is a neces- 
sary and sufficient condition for the existence of a junction 
tree. We build a junction tree by finding a running inter- 
section property ordering of the cliques via the maximum 
cardinahty search algorithm (mcs, implemented in the rip 
function of the gRbase R package). 

A Gaussian graphical model with dependence graph 
G = {V, E) where \ V\ — P can be defined as the multivari- 
ate normal distribution M(G) = { F ~ Np(fj., E), 
K='Z-^ e S^(G)} where 5+(G) is the set of symmetric 
positive definite matrices with null elements corresponding 
to the missing edges of G. 



